This is the second installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.
In this notebook, I demonstrate logistic regression using the Titanic competition.
In [1188]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import code.Linear_Regression_Funcs as LRF
import code.Logistic_Regression_Funcs as LGF
In [1189]:
reload(LGF)
Out[1189]:
In [1190]:
train = pd.read_csv("./data/titanic/train.csv", index_col="PassengerId")
train.head()
Out[1190]:
In [1191]:
# Fill embarcation location nan with a string
#train.Embarked = train.Embarked.fillna('nan')
# Create name category from titles in the name column
#train = LGF.nametitles(train)
In [1192]:
#temp = pd.crosstab([train.Pclass, train.Sex],train.Survived.astype(bool))
#temp
In [1193]:
#sb.set(style="white")
#sb.factorplot('Pclass','Survived','Sex',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Pclass',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Sex',data=train,palette="muted")
#fg = sb.FacetGrid(train,hue="Pclass",aspect=3,palette="muted")
#fg.map(sb.kdeplot,"Age",bw=4,shade=True,legend=True)
#fg.set(xlim=(0,80))
In [1193]:
In [1194]:
## Transform categorical variables into numeric indicators (For examination only)
#temp = LGF.cat2indicator(train, ['Embarked','Pclass','Sex']) # Embarcation, Class, Sex
#
## Examine data grouped by survival
#temp.groupby(temp.Survived).describe()
In [1195]:
y = train['Survived']
In [1196]:
# X is an [m x n] matrix.
# m = number of observations
# n = number of predictors
X = LGF.make_matrix(train)
In [1214]:
results = sm.Logit(y,X).fit(maxiter=1000,method='bfgs')
In [1198]:
results.summary()
Out[1198]:
In [1212]:
ypredict = results.predict(X)
ypredict = np.round(ypredict)
print "score on training data = ",LGF.score(y,ypredict)
In [1200]:
# Read the test data
test = pd.read_csv("./data/titanic/test.csv",index_col="PassengerId")
In [1201]:
# Construct test model matrix
Xtest = LGF.make_matrix(test,matchcols=X.columns[1:])
In [1202]:
# Calculate predictions by applying model parameters to test model matrix
Ypredict = pd.DataFrame(results.predict(Xtest),index=Xtest.index)
Ypredict = np.round(Ypredict)
Ypredict.columns = ['Survived']
Ypredict = Ypredict.astype(int)
Ypredict.to_csv('./predictions/Logistic_Regression_Prediction.csv',sep=',')
This submission scored a 0.77512, placing 1332 out of 2075 submissions. This is the same score as the "My First Random Forest" benchmark.